In [1]:
# importing libraries that I will be using.

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from sklearn.preprocessing import MinMaxScaler
import statsmodels.api as sm
import plotly.express as px

Border Crossing Dataset:

In [2]:
# reading in Border_Crossing_Entry_Data.csv and creating a dataframe.

border_crossing = pd.read_csv('Border_Crossing_Entry_Data.csv')

border_crossing
Out[2]:
Port Name State Port Code Border Date Measure Value Latitude Longitude Point
0 International Falls Minnesota 3604 US-Canada Border Oct 2023 Trucks 1372 48.608 -93.401 POINT (-93.401355 48.6078)
1 Sumas Washington 3009 US-Canada Border Oct 2023 Train Passengers 57 49.002 -122.265 POINT (-122.264805 49.002388)
2 Naco Arizona 2603 US-Mexico Border Sep 2023 Trucks 270 31.334 -109.948 POINT (-109.948413 31.334084)
3 Wildhorse Montana 3323 US-Canada Border Sep 2023 Trucks 42 48.999 -110.215 POINT (-110.215083 48.999361)
4 Calais Maine 115 US-Canada Border Sep 2023 Trains 15 45.189 -67.275 POINT (-67.275381 45.188548)
... ... ... ... ... ... ... ... ... ... ...
388818 Walhalla North Dakota 3407 US-Canada Border Mar 1996 Pedestrians 0 49.000 -97.908 POINT (-97.908416 49.000472)
388819 Norton Vermont 211 US-Canada Border Aug 1997 Rail Containers Loaded 1430 45.011 -71.793 POINT (-71.793219 45.010771)
388820 Point Roberts Washington 3017 US-Canada Border Jun 1996 Pedestrians 1469 49.002 -123.068 POINT (-123.068055556 49.0020555547)
388821 Blaine Washington 3004 US-Canada Border Feb 1996 Personal Vehicles 297396 48.994 -122.749 POINT (-122.748884 48.994045)
388822 Sherwood North Dakota 3414 US-Canada Border Jun 1997 Personal Vehicle Passengers 4304 48.999 -101.628 POINT (-101.627527 48.999305)

388823 rows × 10 columns

Let's start getting to know the data! This will be able to tell us what we're working with, what we will and will not need and if there will be any cleaning necessary. I do a dataset at a time.

In [3]:
border_crossing.info()

# Date column will need to be changed to datetime. Latitude, Longitude, and Point can be dropped as they will not be needed (we
# have port codes as well as the port name, state and border entry area).
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 388823 entries, 0 to 388822
Data columns (total 10 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   Port Name  388823 non-null  object 
 1   State      388823 non-null  object 
 2   Port Code  388823 non-null  int64  
 3   Border     388823 non-null  object 
 4   Date       388823 non-null  object 
 5   Measure    388823 non-null  object 
 6   Value      388823 non-null  int64  
 7   Latitude   388822 non-null  float64
 8   Longitude  388822 non-null  float64
 9   Point      388822 non-null  object 
dtypes: float64(2), int64(2), object(6)
memory usage: 29.7+ MB
In [4]:
# Finding the number of duplicates.

border_crossing.duplicated().sum()
Out[4]:
10
In [5]:
# Finding the duplicates. Will need these later to drop them.

border_dupes = border_crossing.duplicated()

border_dupes2 = border_crossing[border_dupes]

border_dupes2
Out[5]:
Port Name State Port Code Border Date Measure Value Latitude Longitude Point
35075 Fort Kent Maine 110 US-Canada Border Jun 2019 Personal Vehicles 9615 47.249 -68.604 POINT (-68.603918 47.249206)
36829 Detroit Michigan 3801 US-Canada Border Jun 2019 Personal Vehicles 351889 42.332 -83.048 POINT (-83.047924 42.331685)
48977 Del Rio Texas 2302 US-Mexico Border Jun 2019 Personal Vehicles 123068 29.327 -100.928 POINT (-100.927612 29.326784)
50238 Del Rio Texas 2302 US-Mexico Border Jun 2019 Personal Vehicle Passengers 238721 29.327 -100.928 POINT (-100.927612 29.326784)
51398 Santa Teresa New Mexico 2408 US-Mexico Border Jun 2019 Personal Vehicles 50672 31.784 -106.679 POINT (-106.679437 31.783939)
52318 Roma Texas 2310 US-Mexico Border Jun 2019 Personal Vehicles 54953 26.404 -99.019 POINT (-99.018981 26.403928)
56226 Roma Texas 2310 US-Mexico Border Jun 2019 Personal Vehicle Passengers 104640 26.404 -99.019 POINT (-99.018981 26.403928)
59053 Detroit Michigan 3801 US-Canada Border Jun 2019 Personal Vehicle Passengers 599521 42.332 -83.048 POINT (-83.047924 42.331685)
59852 Santa Teresa New Mexico 2408 US-Mexico Border Jun 2019 Personal Vehicle Passengers 110576 31.784 -106.679 POINT (-106.679437 31.783939)
61335 Fort Kent Maine 110 US-Canada Border Jun 2019 Personal Vehicle Passengers 14056 47.249 -68.604 POINT (-68.603918 47.249206)

As we can see above, these are not actual duplicates. We will still include these in the dataset.

In [6]:
# Finding the sum of all null values in the dataset.

border_crossing.isnull().sum()
Out[6]:
Port Name    0
State        0
Port Code    0
Border       0
Date         0
Measure      0
Value        0
Latitude     1
Longitude    1
Point        1
dtype: int64

No need to worry about the nulls here as those columns will be dropped.

In [7]:
border_crossing['State'].unique()
Out[7]:
array(['Minnesota', 'Washington', 'Arizona', 'Montana', 'Maine', 'Alaska',
       'Texas', 'North Dakota', 'Vermont', 'New York', 'Michigan',
       'New Mexico', 'California', 'Idaho'], dtype=object)
In [8]:
border_crossing['Port Name'].unique()
Out[8]:
array(['International Falls', 'Sumas', 'Naco', 'Wildhorse', 'Calais',
       'Anchorage', 'Roma', 'Fort Kent', 'Dunseith', 'Douglas', 'Norton',
       'Hansboro', 'Richford', 'Morgan', 'Beecher Falls', 'Alcan',
       'Laredo', 'Metaline Falls', 'Massena', 'Hidalgo', 'Detroit',
       'Del Rio', 'Limestone', 'Warroad', 'Nogales', 'Scobey', 'Opheim',
       'Eagle Pass', 'Turner', 'El Paso', 'Sweetgrass', 'St John',
       'Pembina', 'Roosville', 'Alexandria Bay', 'Columbus',
       'Brownsville', 'Oroville', 'Skagway', 'Sault Sainte Marie',
       'Madawaska', 'Van Buren', 'Houlton', 'Otay Mesa', 'Baudette',
       'Dalton Cache', 'Friday Harbor', 'Roseau', 'Highgate Springs',
       'Sherwood', 'Piegan', 'Lancaster', 'Lukeville', 'Eastport',
       'Porthill', 'Del Bonita', 'Portal', 'Progreso', 'Maida',
       'Fort Fairfield', 'Ambrose', 'Whitlash', 'Ferry', 'Grand Portage',
       'Andrade', 'San Luis', 'Nighthawk', 'Fortuna', 'Calexico East',
       'Laurier', 'Port Angeles', 'Raymond', 'Neche',
       'Champlain Rouses Point', 'Tornillo', 'Ogdensburg', 'San Ysidro',
       'Tecate', 'Bridgewater', 'Port Huron', 'Danville', 'Point Roberts',
       'Vanceboro', 'Ysleta', 'Blaine', 'Jackman', 'Boundary', 'Walhalla',
       'Rio Grande City', 'Carbury', 'Sarles', 'Presidio', 'Derby Line',
       'Antler', 'Santa Teresa', 'Kenneth G Ward',
       'Buffalo Niagara Falls', 'Pinecreek', 'Sasabe', 'Frontier',
       'Trout River', 'Westhope', 'Cross Border Xpress', 'Northgate',
       'Hannah', 'Willow Creek', 'Noonan', 'Calexico', 'Algonac',
       'Boquillas', 'Noyes', 'Portland', 'Ketchikan', 'Cape Vincent',
       'Bar Harbor', 'Anacortes', 'Whitetail'], dtype=object)
In [9]:
border_crossing['Measure'].unique()
Out[9]:
array(['Trucks', 'Train Passengers', 'Trains', 'Truck Containers Empty',
       'Pedestrians', 'Buses', 'Bus Passengers', 'Rail Containers Empty',
       'Truck Containers Loaded', 'Personal Vehicles',
       'Personal Vehicle Passengers', 'Rail Containers Loaded'],
      dtype=object)
In [10]:
border_crossing.nsmallest(8, 'Value')
Out[10]:
Port Name State Port Code Border Date Measure Value Latitude Longitude Point
118 Dalton Cache Alaska 3106 US-Canada Border Feb 2017 Pedestrians 0 59.451 -136.362 POINT (-136.361944 59.450556)
121 Friday Harbor Washington 3014 US-Canada Border Jan 2017 Trucks 0 48.534 -123.016 POINT (-123.016484 48.534485)
122 Roosville Montana 3318 US-Canada Border Dec 2016 Buses 0 49.000 -115.056 POINT (-115.056027 48.999638)
125 Roseau Minnesota 3426 US-Canada Border Oct 2016 Buses 0 49.000 -95.766 POINT (-95.766469 48.999538)
128 Sherwood North Dakota 3414 US-Canada Border Jul 2016 Trains 0 48.999 -101.628 POINT (-101.627527 48.999305)
129 Morgan Montana 3319 US-Canada Border Jul 2016 Buses 0 49.000 -107.832 POINT (-107.831819 48.999829)
133 Metaline Falls Washington 3025 US-Canada Border Jun 2016 Train Passengers 0 49.000 -117.299 POINT (-117.299444 48.999972)
135 Turner Montana 3306 US-Canada Border May 2016 Trains 0 49.000 -108.388 POINT (-108.387916 48.999527)
In [11]:
border_crossing.nlargest(8, 'Value')
Out[11]:
Port Name State Port Code Border Date Measure Value Latitude Longitude Point
327704 El Paso Texas 2402 US-Mexico Border Mar 2001 Personal Vehicle Passengers 4447374 31.764 -106.451 POINT (-106.451188 31.764363)
323302 El Paso Texas 2402 US-Mexico Border Aug 2000 Personal Vehicle Passengers 4291774 31.764 -106.451 POINT (-106.451188 31.764363)
317938 El Paso Texas 2402 US-Mexico Border Feb 2001 Personal Vehicle Passengers 4281175 31.764 -106.451 POINT (-106.451188 31.764363)
302975 El Paso Texas 2402 US-Mexico Border Apr 2001 Personal Vehicle Passengers 4252128 31.764 -106.451 POINT (-106.451188 31.764363)
322237 El Paso Texas 2402 US-Mexico Border Jun 2000 Personal Vehicle Passengers 4235244 31.764 -106.451 POINT (-106.451188 31.764363)
312964 El Paso Texas 2402 US-Mexico Border Mar 2000 Personal Vehicle Passengers 4157988 31.764 -106.451 POINT (-106.451188 31.764363)
306451 El Paso Texas 2402 US-Mexico Border Sep 2000 Personal Vehicle Passengers 4139166 31.764 -106.451 POINT (-106.451188 31.764363)
321341 El Paso Texas 2402 US-Mexico Border Jan 2001 Personal Vehicle Passengers 4118534 31.764 -106.451 POINT (-106.451188 31.764363)
In [12]:
# Checking dates. These will need to match up with the drug_overdose dataset. Rows will need to br dropped accordingly.

border_max = border_crossing['Date'].max()

border_min = border_crossing['Date'].min()

print(border_max) 
print(border_min)
Sep 2023
Apr 1996

Drug Overdose Dataset:

In [13]:
# reading in VSRR_Provisional_Drug_Overdose_Death_Counts.csv and creating a dataframe.

drug_overdose = pd.read_csv('VSRR_Provisional_Drug_Overdose_Death_Counts.csv')

drug_overdose
Out[13]:
State Year Month Period Indicator Data Value Percent Complete Percent Pending Investigation State Name Footnote Footnote Symbol Predicted Value
0 AK 2015 April 12 month-ending Cocaine (T40.5) NaN 100 0.00000 Alaska Numbers may differ from published reports usin... ** NaN
1 AK 2015 April 12 month-ending Natural & semi-synthetic opioids (T40.2) NaN 100 0.00000 Alaska Numbers may differ from published reports usin... ** NaN
2 AK 2015 April 12 month-ending Number of Drug Overdose Deaths 126 100 0.00000 Alaska Numbers may differ from published reports usin... ** 126
3 AK 2015 April 12 month-ending Methadone (T40.3) NaN 100 0.00000 Alaska Numbers may differ from published reports usin... ** NaN
4 AK 2015 April 12 month-ending Synthetic opioids, excl. methadone (T40.4) NaN 100 0.00000 Alaska Numbers may differ from published reports usin... ** NaN
... ... ... ... ... ... ... ... ... ... ... ... ...
63331 YC 2023 May 12 month-ending Cocaine (T40.5) 1,592 100 0.37068 New York City Underreported due to incomplete data. * 1,677
63332 YC 2023 May 12 month-ending Percent with drugs specified 99.77897063 100 0.37068 New York City Underreported due to incomplete data. * NaN
63333 YC 2023 May 12 month-ending Natural & semi-synthetic opioids, incl. methad... 685 100 0.37068 New York City Underreported due to incomplete data. * 714
63334 YC 2023 May 12 month-ending Opioids (T40.0-T40.4,T40.6) 2,693 100 0.37068 New York City Underreported due to incomplete data. * 2,814
63335 YC 2023 May 12 month-ending Synthetic opioids, excl. methadone (T40.4) 2,534 100 0.37068 New York City Underreported due to incomplete data. * 2,661

63336 rows × 12 columns

In [14]:
drug_overdose.sample(8)
Out[14]:
State Year Month Period Indicator Data Value Percent Complete Percent Pending Investigation State Name Footnote Footnote Symbol Predicted Value
2554 AR 2015 January 12 month-ending Opioids (T40.0-T40.4,T40.6) NaN 100 0.049797 Arkansas Numbers may differ from published reports usin... ** NaN
18700 IL 2023 March 12 month-ending Percent with drugs specified 98.5796949 100 0.123061 Illinois Underreported due to incomplete data. * NaN
50866 TX 2015 April 12 month-ending Heroin (T40.1) NaN 100 0.178251 Texas Numbers may differ from published reports usin... ** NaN
15282 IA 2017 August 12 month-ending Heroin (T40.1) 61 100 0.006643 Iowa Numbers may differ from published reports usin... ** 61
27945 MN 2016 December 12 month-ending Number of Drug Overdose Deaths 664 100 0.000000 Minnesota Numbers may differ from published reports usin... ** 664
3870 AZ 2015 October 12 month-ending Natural, semi-synthetic, & synthetic opioids, ... NaN 100 0.249436 Arizona Numbers may differ from published reports usin... ** NaN
16631 ID 2017 November 12 month-ending Psychostimulants with abuse potential (T43.6) NaN 100 0.028703 Idaho Numbers may differ from published reports usin... ** NaN
59963 WV 2017 June 12 month-ending Opioids (T40.0-T40.4,T40.6) 849 100 0.205679 West Virginia Numbers may differ from published reports usin... ** 868
In [15]:
drug_overdose.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 63336 entries, 0 to 63335
Data columns (total 12 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   State                          63336 non-null  object 
 1   Year                           63336 non-null  int64  
 2   Month                          63336 non-null  object 
 3   Period                         63336 non-null  object 
 4   Indicator                      63336 non-null  object 
 5   Data Value                     51376 non-null  object 
 6   Percent Complete               63336 non-null  int64  
 7   Percent Pending Investigation  63336 non-null  float64
 8   State Name                     63336 non-null  object 
 9   Footnote                       63336 non-null  object 
 10  Footnote Symbol                63336 non-null  object 
 11  Predicted Value                40655 non-null  object 
dtypes: float64(1), int64(2), object(9)
memory usage: 5.8+ MB
In [16]:
# Checking for the years in this dataset. The border_crossing dataset goes back to 1997. If this doesn't go back as far, will
# isolate the years in the border_crossing set.

drug_overdose['Year'].unique()
Out[16]:
array([2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023], dtype=int64)
In [17]:
# Finding null values.

drug_overdose.isnull().sum()
Out[17]:
State                                0
Year                                 0
Month                                0
Period                               0
Indicator                            0
Data Value                       11960
Percent Complete                     0
Percent Pending Investigation        0
State Name                           0
Footnote                             0
Footnote Symbol                      0
Predicted Value                  22681
dtype: int64

Data Cleaning:

border_crossing:

  • Change the date column to datetime.
  • Drop columns Latitude, Longitude, and Point.
  • Drop all rows where 'Year' != 2015 and above. We are isolating what matches the drug_overdose dataset.
  • Drop all rows where date is above 2023-08-01. This will match the drug_overdose set.
  • Find ports with the highest values in each state. I only want to use ports in each state with the highest values.
  • 'Melt' the 'Measure' values into one value. After trying multiple ways to use individual vehicles in the analysis, I've come to the conclusion that I won't be able to. I'm going to combine individual vehicles per month per state and port into one value.
  • Change the data type for the 'Vehicle Total' from float64 to int64.

drug_overdose:

  • Drop any 'indicator' rows that are not 'Number of Drug Overdose Deaths'.
  • Combine 'Year' and 'Month' to create one datetime column.
  • Figure out what YC is in the state column???
  • Drop the 'State' Column. Rename 'State Name' to 'State'.
  • Drop 'Percent Complete', 'Percent Pending Investigation', 'Footnote', 'Footnote symbol', and 'Predicted Value'.
  • Deal with the nulls in 'Data Value'. I have a feeling this is because of certain rows that I won't be using anyway so once those are gone then I will reevaluate.
  • Drop states that are not on the border_crossing datatset.

I will be starting with cleaning of the border_crossing dataset and moving on to the drug_overdose after that. This will ensure that things stay nice and tidy.

border_crossing cleaning:

In [18]:
# Change the date column to datetime. 

border_crossing['Date'] = pd.to_datetime(border_crossing['Date'])

border_crossing.sample(5)
Out[18]:
Port Name State Port Code Border Date Measure Value Latitude Longitude Point
327326 International Falls Minnesota 3604 US-Canada Border 2000-10-01 Truck Containers Loaded 2802 48.608 -93.401 POINT (-93.401355 48.6078)
356871 Naco Arizona 2603 US-Mexico Border 1998-08-01 Personal Vehicles 26582 31.334 -109.948 POINT (-109.948413 31.334084)
187327 Dunseith North Dakota 3422 US-Canada Border 2008-06-01 Trains 0 48.999 -100.052 POINT (-100.052166 48.999277)
27942 Calais Maine 115 US-Canada Border 2021-02-01 Rail Containers Empty 146 45.189 -67.275 POINT (-67.275381 45.188548)
240506 Laredo Texas 2304 US-Mexico Border 2005-10-01 Rail Containers Empty 9725 27.500 -99.507 POINT (-99.507412 27.499561)
In [19]:
# Making sure the 'Date' column was changed to datetime.

print(border_crossing.dtypes)
Port Name            object
State                object
Port Code             int64
Border               object
Date         datetime64[ns]
Measure              object
Value                 int64
Latitude            float64
Longitude           float64
Point                object
dtype: object
In [20]:
# Dropping Port Code, Latitude, Longitude, and Point columns as they are not needed. Keeping inplace=True because I just want to modify
# the original dataframe and not create a new one.

border_crossing.drop(columns=['Port Code', 'Latitude', 'Longitude', 'Point'], inplace=True)

border_crossing.sample(5)
Out[20]:
Port Name State Border Date Measure Value
212584 Metaline Falls Washington US-Canada Border 2006-11-01 Trucks 656
203566 Laredo Texas US-Mexico Border 2008-01-01 Pedestrians 351470
904 Detroit Michigan US-Canada Border 2023-11-01 Train Passengers 316
375203 Whitetail Montana US-Canada Border 1996-01-01 Truck Containers Empty 0
156321 Grand Portage Minnesota US-Canada Border 2009-08-01 Truck Containers Empty 565
In [21]:
# Dropping all rows where year in 'Date' is below 2015.

border_crossing = border_crossing[border_crossing['Date'].dt.year >= 2015]

border_crossing.sample(5)
Out[21]:
Port Name State Border Date Measure Value
12264 Laredo Texas US-Mexico Border 2022-12-01 Bus Passengers 57176
41009 Morgan Montana US-Canada Border 2018-10-01 Truck Containers Empty 14
86615 Laurier Washington US-Canada Border 2016-01-01 Truck Containers Loaded 604
90947 Calexico East California US-Mexico Border 2015-11-01 Truck Containers Loaded 15568
50374 Norton Vermont US-Canada Border 2020-04-01 Personal Vehicle Passengers 147
In [22]:
# Checking to make sure there are no years below 2015 left.

checking_years = border_crossing['Date'].dt.year.unique()

print(checking_years)
[2023 2022 2021 2020 2019 2018 2017 2016 2015]
In [23]:
# Checking dates. These will need to match with the drug_overdose dataset. Will drop rows accordingly.

border_min = border_crossing['Date'].min()

border_max = border_crossing['Date'].max()

print(border_min)
print(border_max)
2015-01-01 00:00:00
2023-12-01 00:00:00
In [24]:
# Filtering and dropping any rows with dates above 2023-08-01 as the drug_overdose dataset only goes to 2023-08-01.

border_dates = border_crossing['Date'] <= '2023-08-01'

border_filtered = border_crossing[border_dates]

border_filtered.sample(5)
Out[24]:
Port Name State Border Date Measure Value
20710 International Falls Minnesota US-Canada Border 2022-07-01 Personal Vehicles 24162
58521 Van Buren Maine US-Canada Border 2017-08-01 Pedestrians 21
68578 El Paso Texas US-Mexico Border 2016-07-01 Rail Containers Loaded 4467
38730 Buffalo Niagara Falls New York US-Canada Border 2017-05-01 Trains 156
4849 Eagle Pass Texas US-Mexico Border 2020-10-01 Pedestrians 41944
In [25]:
# Making sure the correct dates were dropped. Everything above 2023-08-01.

border_min2 = border_filtered['Date'].min()

border_max2 = border_filtered['Date'].max()

print(border_min2)
print(border_max2)
2015-01-01 00:00:00
2023-08-01 00:00:00
In [26]:
border_filtered['Measure'].unique()
Out[26]:
array(['Pedestrians', 'Buses', 'Trucks', 'Bus Passengers', 'Trains',
       'Truck Containers Empty', 'Rail Containers Empty',
       'Train Passengers', 'Truck Containers Loaded', 'Personal Vehicles',
       'Personal Vehicle Passengers', 'Rail Containers Loaded'],
      dtype=object)
In [27]:
border_filtered['State'].unique()
Out[27]:
array(['Maine', 'Texas', 'Washington', 'North Dakota', 'Arizona',
       'Vermont', 'Montana', 'Alaska', 'New York', 'Michigan',
       'Minnesota', 'New Mexico', 'California', 'Idaho'], dtype=object)

Finding the ports per state with the highest values.

These are the ports I will be using going forward. Doing multiple ports in each state and multiple types of vehicles was not working.

In [28]:
# Filtering and finding the rows by 'Port Name' and 'Value'. This will sort by the highest value by port alphabetically.

high_try = border_filtered.loc[border_filtered.groupby('Port Name')['Value'].idxmax()].sort_values(by='State')

# I only want to see these columns.

high_try_columns = high_try[['Port Name', 'State', 'Value']]

# I need to see all of the rows. This just tells pandas to display all the rows here.

pd.set_option('display.max_rows', None)

# Print it.

print(high_try_columns)
                    Port Name         State    Value
47906                   Alcan        Alaska    33124
34627            Dalton Cache        Alaska     8443
90329               Ketchikan        Alaska     1095
40086                 Skagway        Alaska    54120
45724               Lukeville       Arizona   133395
47621                 Nogales       Arizona   689371
35680                San Luis       Arizona   535446
35946                    Naco       Arizona    54902
35600                  Sasabe       Arizona     5918
50809                 Douglas       Arizona   282951
70672           Calexico East    California   678287
10599     Cross Border Xpress    California   254418
47029              San Ysidro    California  2470347
36984                  Tecate    California   193536
58056                Calexico    California   826331
42351                 Andrade    California   130116
57837               Otay Mesa    California  1252553
48792                Eastport         Idaho    50839
63177                Porthill         Idaho    25341
89649               Limestone         Maine     3796
68984               Van Buren         Maine    28212
47060                Portland         Maine     8694
58491                 Jackman         Maine    60239
85887                 Houlton         Maine    62823
89388               Fort Kent         Maine    23090
72684          Fort Fairfield         Maine    20059
16885               Vanceboro         Maine     8188
72373               Madawaska         Maine    61181
61524             Bridgewater         Maine     8898
10368              Bar Harbor         Maine     6263
53153                  Calais         Maine   150385
51168                 Detroit      Michigan   721390
76687              Port Huron      Michigan   389663
64565      Sault Sainte Marie      Michigan   159314
24612                 Algonac      Michigan    13082
36895     International Falls     Minnesota    93987
78077               Lancaster     Minnesota    10350
71081               Pinecreek     Minnesota     1098
50831                Baudette     Minnesota    34712
67458                  Roseau     Minnesota     9104
52694                 Warroad     Minnesota    22936
66635           Grand Portage     Minnesota    57345
15709            Willow Creek       Montana     1178
73531                  Turner       Montana     2645
46044                  Piegan       Montana    68177
78060              Sweetgrass       Montana    57825
37218                  Opheim       Montana     1002
90949                  Morgan       Montana     1909
82266              Del Bonita       Montana    10239
30095                  Scobey       Montana     1546
63494                 Raymond       Montana     6932
59674               Roosville       Montana    52814
54416               Wildhorse       Montana     4971
63588                Whitlash       Montana      190
61030                Columbus    New Mexico    78551
27433            Santa Teresa    New Mexico   159044
64723             Trout River      New York    37870
76144                 Massena      New York   146959
47689            Cape Vincent      New York     6244
66143  Champlain Rouses Point      New York   368761
41574   Buffalo Niagara Falls      New York  1453934
65986          Alexandria Bay      New York   198917
84851              Ogdensburg      New York    68684
91455                 St John  North Dakota     3392
82644                Sherwood  North Dakota     3644
41056                 Pembina  North Dakota    91644
54643                  Portal  North Dakota    25943
84064                  Antler  North Dakota     1899
70963               Northgate  North Dakota     4091
51396                 Ambrose  North Dakota      348
67992                Walhalla  North Dakota     9454
53578                 Carbury  North Dakota     2765
42500                Westhope  North Dakota     2293
66226                Hansboro  North Dakota     1908
66594                   Maida  North Dakota     2295
45512                 Fortuna  North Dakota     4056
87164                   Neche  North Dakota     7640
67217                Dunseith  North Dakota    24019
75200                  Noonan  North Dakota     6230
59783                  Sarles  North Dakota     1077
81032                  Hannah  North Dakota      256
40015                 El Paso         Texas  1951761
57753              Eagle Pass         Texas   555481
46736                    Roma         Texas   146951
34737         Rio Grande City         Texas    85839
78786                 Hidalgo         Texas   954371
47369                Presidio         Texas   146085
66417                  Laredo         Texas  1104774
48697             Brownsville         Texas   910576
59615                Tornillo         Texas    87927
53309               Boquillas         Texas     4362
49637                 Del Rio         Texas   292251
3997                 Progreso         Texas   277680
26256                  Ysleta         Texas   549464
55017              Derby Line       Vermont   157491
44295           Beecher Falls       Vermont    24549
37280        Highgate Springs       Vermont   171263
86330                  Norton       Vermont    12018
66543                Richford       Vermont    23179
84436                Danville    Washington     6662
44284                   Ferry    Washington     3526
43988           Friday Harbor    Washington     1652
33554                Frontier    Washington    10276
44520          Kenneth G Ward    Washington   141003
79726                 Laurier    Washington    11938
57401          Metaline Falls    Washington     9816
38838                Boundary    Washington     6996
35970           Point Roberts    Washington   157913
43256                   Sumas    Washington   179464
48643            Port Angeles    Washington    28142
76025               Anacortes    Washington    13638
67174                  Blaine    Washington   885213
47780               Nighthawk    Washington     1990
47254                Oroville    Washington    53717

Ports with the highest values.

Ports that are not listed here will be filtered out and dropped.

  1. Alaska / Skagway
  2. Arizona / Nogales
  3. California / San Ysidro
  4. Idaho / Eastport
  5. Maine / Calais
  6. Michigan / Detroit
  7. Minnesota / International Falls
  8. Montana / Piegan
  9. New Mexico / Santa Teresa
  10. New York / Buffalo Niagara Falls
  11. North Dakota / Pembina
  12. Texas / El Paso
  13. Vermont / Highgate Springs
  14. Washington / Blaine
In [29]:
# Filtering border_filtered by the ports listed because it is faster to filter than drop over 100 different ports.

ports = ['Skagway', 'Nogales', 'San Ysidro', 'Eastport', 'Calais', 'Detroit', 'International Falls', 'Piegan', 'Santa Teresa',
            'Buffalo Niagara Falls', 'Pembina', 'El Paso', 'Highgate Springs', 'Blaine']


border_final = border_filtered[border_filtered['Port Name'].isin(ports)]

# Checking to make sure the only ports are the ones listed above.

border_final['Port Name'].unique()
Out[29]:
array(['Detroit', 'Calais', 'Nogales', 'El Paso', 'Pembina', 'Skagway',
       'Highgate Springs', 'Piegan', 'Eastport', 'International Falls',
       'San Ysidro', 'Blaine', 'Buffalo Niagara Falls', 'Santa Teresa'],
      dtype=object)

Using pivot_table() to reshape the dataframe so that i can add all of the vehicle columns into one.

In [30]:
border_pivot = pd.pivot_table(border_final, values='Value', columns='Measure', index=['State', 'Date'], aggfunc="sum")

border_pivot.sample(5)
Out[30]:
Measure Bus Passengers Buses Pedestrians Personal Vehicle Passengers Personal Vehicles Rail Containers Empty Rail Containers Loaded Train Passengers Trains Truck Containers Empty Truck Containers Loaded Trucks
State Date
Washington 2023-06-01 12188.0 714.0 5797.0 510224.0 264182.0 11488.0 7169.0 14784.0 165.0 11251.0 21044.0 32097.0
Alaska 2020-05-01 NaN NaN NaN 279.0 170.0 NaN NaN NaN NaN 127.0 1.0 128.0
Michigan 2015-03-01 16384.0 1528.0 0.0 572248.0 334342.0 5562.0 8794.0 546.0 182.0 32024.0 139753.0 128818.0
Arizona 2018-11-01 15579.0 755.0 273033.0 573495.0 289059.0 3707.0 2788.0 272.0 68.0 5125.0 22307.0 26777.0
Texas 2023-06-01 17054.0 854.0 335257.0 1150849.0 678657.0 6881.0 2330.0 NaN 107.0 6672.0 4559.0 11199.0

Explaining replacing NaN values instead of imputing:

Originally when I used pivot_table() to reshape the dataframe to add all the vehicle columns, there ended up being many NaN values. I didn't understand why until I started locating the rows, specifically on 'State' and 'Date'. For some states, all vehicle types are recorded with associating values. However, there are some states that don't have certain vehicle types recorded at all.

I decided against imputing because I didn't want to add data with kNN that could influence analytics and models later. I simply decided to replace 'NaN' with pd.NA, this signifies that the data itself is actually missing after the pivot table adjustment.

Examples below.

In [31]:
# Example showing how Michigan has all vehicle types.

row_check = border_final.loc[(border_final['State'] == 'Michigan') & (border_final['Date'] == '2023-07-01')]

print(row_check)
      Port Name     State            Border       Date  \
2450    Detroit  Michigan  US-Canada Border 2023-07-01   
6951    Detroit  Michigan  US-Canada Border 2023-07-01   
18265   Detroit  Michigan  US-Canada Border 2023-07-01   
18815   Detroit  Michigan  US-Canada Border 2023-07-01   
19970   Detroit  Michigan  US-Canada Border 2023-07-01   
20218   Detroit  Michigan  US-Canada Border 2023-07-01   
20671   Detroit  Michigan  US-Canada Border 2023-07-01   
24332   Detroit  Michigan  US-Canada Border 2023-07-01   
27220   Detroit  Michigan  US-Canada Border 2023-07-01   
30191   Detroit  Michigan  US-Canada Border 2023-07-01   

                           Measure   Value  
2450        Rail Containers Loaded    6995  
6951                         Buses    1141  
18265  Personal Vehicle Passengers  648638  
18815            Personal Vehicles  324073  
19970                       Trucks  135359  
20218                       Trains     129  
20671      Truck Containers Loaded  193823  
24332               Bus Passengers    6511  
27220       Truck Containers Empty   30622  
30191        Rail Containers Empty    3568  
In [32]:
# Example showing that California does not have all vehicle types.

row_check = border_final.loc[(border_final['State'] == 'California') & (border_final['Date'] == '2023-07-01')]

print(row_check)
        Port Name       State            Border       Date  \
6706   San Ysidro  California  US-Mexico Border 2023-07-01   
7971   San Ysidro  California  US-Mexico Border 2023-07-01   
19634  San Ysidro  California  US-Mexico Border 2023-07-01   
24377  San Ysidro  California  US-Mexico Border 2023-07-01   
25698  San Ysidro  California  US-Mexico Border 2023-07-01   

                           Measure    Value  
6706                         Buses     3442  
7971                Bus Passengers    41571  
19634            Personal Vehicles  1369823  
24377  Personal Vehicle Passengers  2302112  
25698                  Pedestrians   632785  
In [33]:
border_pivot.replace(np.nan, pd.NA, inplace=True)

border_pivot.sample(5)
Out[33]:
Measure Bus Passengers Buses Pedestrians Personal Vehicle Passengers Personal Vehicles Rail Containers Empty Rail Containers Loaded Train Passengers Trains Truck Containers Empty Truck Containers Loaded Trucks
State Date
Minnesota 2015-03-01 358.0 16.0 716.0 45459.0 32059.0 14018.0 50630.0 714.0 357.0 103.0 1486.0 1564.0
2019-09-01 472.0 21.0 816.0 66093.0 33817.0 14567.0 58754.0 712.0 354.0 361.0 888.0 1246.0
Maine 2017-05-01 308.0 8.0 885.0 103498.0 75174.0 26.0 49.0 24.0 20.0 3584.0 7810.0 5748.0
New Mexico 2017-08-01 233.0 25.0 12905.0 118579.0 51823.0 <NA> <NA> <NA> <NA> 4208.0 5886.0 10011.0
2020-07-01 <NA> <NA> 183.0 57743.0 32607.0 <NA> <NA> <NA> <NA> 5587.0 7654.0 13395.0
In [34]:
# Creating a new column, 'Vehicle Total', by adding all of the vehicle type values into one.

border_pivot['Vehicle Total'] = border_pivot.sum(axis=1)

border_pivot.sample(5)
Out[34]:
Measure Bus Passengers Buses Pedestrians Personal Vehicle Passengers Personal Vehicles Rail Containers Empty Rail Containers Loaded Train Passengers Trains Truck Containers Empty Truck Containers Loaded Trucks Vehicle Total
State Date
Washington 2018-12-01 16550.0 877.0 <NA> 637811.0 323186.0 8661.0 9233.0 4475.0 224.0 8223.0 19686.0 27497.0 1056423.0
Maine 2015-05-01 844.0 23.0 1469.0 127227.0 81021.0 16.0 45.0 22.0 11.0 1754.0 3718.0 5917.0 222067.0
2016-12-01 371.0 11.0 218.0 97828.0 64179.0 21.0 61.0 30.0 15.0 1100.0 3060.0 4917.0 171811.0
Arizona 2022-03-01 40908.0 607.0 209087.0 587941.0 312092.0 3467.0 4885.0 656.0 81.0 7111.0 29348.0 36571.0 1232754.0
North Dakota 2016-02-01 2336.0 68.0 0.0 27666.0 14610.0 2767.0 8010.0 190.0 95.0 2750.0 14851.0 17720.0 91063.0
In [35]:
border_pivot.drop(columns=['Bus Passengers', 'Buses', 'Pedestrians', 'Personal Vehicle Passengers', 'Personal Vehicles',
                              'Rail Containers Empty', 'Rail Containers Loaded', 'Train Passengers', 'Truck Containers Empty',
                                  'Truck Containers Loaded', 'Trucks', 'Trains'], inplace=True)

border_pivot.sample(5)
Out[35]:
Measure Vehicle Total
State Date
Arizona 2015-01-01 1190799.0
Montana 2018-12-01 14944.0
Maine 2020-10-01 34377.0
Alaska 2021-07-01 2911.0
Arizona 2019-04-01 1208020.0
In [36]:
border_pivot.info()
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 1456 entries, ('Alaska', Timestamp('2015-01-01 00:00:00')) to ('Washington', Timestamp('2023-08-01 00:00:00'))
Data columns (total 1 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Vehicle Total  1456 non-null   float64
dtypes: float64(1)
memory usage: 15.2+ KB
In [37]:
# Changing the data type to int64.

border_pivot['Vehicle Total'] = border_pivot['Vehicle Total'].astype('int64')

border_pivot.info()
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 1456 entries, ('Alaska', Timestamp('2015-01-01 00:00:00')) to ('Washington', Timestamp('2023-08-01 00:00:00'))
Data columns (total 1 columns):
 #   Column         Non-Null Count  Dtype
---  ------         --------------  -----
 0   Vehicle Total  1456 non-null   int64
dtypes: int64(1)
memory usage: 15.2+ KB

drug_overdose cleaning:

In [38]:
# Dropping the 'State' column. 

drug_overdose.drop(columns=['State'], inplace=True)

drug_overdose.sample(5)
Out[38]:
Year Month Period Indicator Data Value Percent Complete Percent Pending Investigation State Name Footnote Footnote Symbol Predicted Value
1744 2018 July 12 month-ending Number of Deaths 53,478 100 0.207562 Alabama Numbers may differ from published reports usin... ** NaN
20910 2021 June 12 month-ending Methadone (T40.3) NaN 100 0.090949 Kansas Numbers may differ from published reports usin... ** NaN
9717 2021 November 12 month-ending Psychostimulants with abuse potential (T43.6) 12 100 0.013822 District of Columbia Numbers may differ from published reports usin... ** 13
54885 2016 September 12 month-ending Number of Deaths 64,726 100 0.012360 Virginia Numbers may differ from published reports usin... ** NaN
22161 2021 June 12 month-ending Opioids (T40.0-T40.4,T40.6) 1,837 100 0.037698 Kentucky Numbers may differ from published reports usin... ** 1,840
In [39]:
# Renaming 'State Name' to 'State'.

drug_overdose = drug_overdose.rename(columns={'State Name':'State'})

drug_overdose.sample(5)
Out[39]:
Year Month Period Indicator Data Value Percent Complete Percent Pending Investigation State Footnote Footnote Symbol Predicted Value
12884 2017 November 12 month-ending Natural & semi-synthetic opioids (T40.2) 504 100 0.087253 Georgia Numbers may differ from published reports usin... ** 509
26987 2018 December 12 month-ending Number of Deaths 97,545 100 0.104567 Michigan Numbers may differ from published reports usin... ** NaN
38912 2020 October 12 month-ending Psychostimulants with abuse potential (T43.6) 328 100 0.181229 New Mexico Numbers may differ from published reports usin... ** 336
62089 2015 April 12 month-ending Natural & semi-synthetic opioids (T40.2) 176 100 0.020379 New York City Numbers may differ from published reports usin... ** 177
3792 2015 January 12 month-ending Cocaine (T40.5) NaN 100 0.326697 Arizona Numbers may differ from published reports usin... ** NaN

Combining the 'Year' and 'Month' column into one column called 'Date'. This is going to require bringing together strings with a hyphen and then converting to datetime. Before that can happen, the 'Year' column has to be converted to an object.

In [40]:
# Converting 'Year' column to an object.

drug_overdose['Year'] = drug_overdose['Year'].astype(str)

print(drug_overdose.dtypes)
Year                              object
Month                             object
Period                            object
Indicator                         object
Data Value                        object
Percent Complete                   int64
Percent Pending Investigation    float64
State                             object
Footnote                          object
Footnote Symbol                   object
Predicted Value                   object
dtype: object
In [41]:
drug_overdose.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 63336 entries, 0 to 63335
Data columns (total 11 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   Year                           63336 non-null  object 
 1   Month                          63336 non-null  object 
 2   Period                         63336 non-null  object 
 3   Indicator                      63336 non-null  object 
 4   Data Value                     51376 non-null  object 
 5   Percent Complete               63336 non-null  int64  
 6   Percent Pending Investigation  63336 non-null  float64
 7   State                          63336 non-null  object 
 8   Footnote                       63336 non-null  object 
 9   Footnote Symbol                63336 non-null  object 
 10  Predicted Value                40655 non-null  object 
dtypes: float64(1), int64(1), object(9)
memory usage: 5.3+ MB
In [42]:
# Combining the 'Year' and 'Month' with a hyphen.

drug_overdose['Date'] = drug_overdose['Year'].astype(str) + '-' + drug_overdose['Month'].astype(str)

drug_overdose.sample(5)
Out[42]:
Year Month Period Indicator Data Value Percent Complete Percent Pending Investigation State Footnote Footnote Symbol Predicted Value Date
20592 2019 January 12 month-ending Cocaine (T40.5) NaN 100 0.011416 Kansas Numbers may differ from published reports usin... ** NaN 2019-January
62031 2023 January 12 month-ending Number of Drug Overdose Deaths 123 100 0.037251 Wyoming Underreported due to incomplete data. * 123 2023-January
17387 2023 April 12 month-ending Synthetic opioids, excl. methadone (T40.4) 197 100 0.203568 Idaho Underreported due to incomplete data. * 203 2023-April
36910 2015 March 12 month-ending Opioids (T40.0-T40.4,T40.6) NaN 100 0.243882 New Jersey Numbers may differ from published reports usin... ** NaN 2015-March
43322 2016 October 12 month-ending Heroin (T40.1) 52 100 0.015771 Oklahoma Numbers may differ from published reports usin... ** 52 2016-October
In [43]:
# Converting 'Date' over to datetime.

drug_overdose['Date'] = pd.to_datetime(drug_overdose['Date'])

drug_overdose.sample(5)
Out[43]:
Year Month Period Indicator Data Value Percent Complete Percent Pending Investigation State Footnote Footnote Symbol Predicted Value Date
51605 2020 December 12 month-ending Synthetic opioids, excl. methadone (T40.4) 1,012 100 0.130469 Texas Numbers may differ from published reports usin... ** 1,028 2020-12-01
6263 2015 August 12 month-ending Opioids (T40.0-T40.4,T40.6) NaN 100 0.037710 Colorado Numbers may differ from published reports usin... ** NaN 2015-08-01
8053 2018 September 12 month-ending Natural, semi-synthetic, & synthetic opioids, ... 890 100 0.025308 Connecticut Numbers may differ from published reports usin... ** 891 2018-09-01
47845 2020 August 12 month-ending Heroin (T40.1) 247 100 0.064442 South Carolina Numbers may differ from published reports usin... ** 249 2020-08-01
58667 2017 December 12 month-ending Number of Deaths 52,591 100 0.171132 Wisconsin Numbers may differ from published reports usin... ** NaN 2017-12-01
In [44]:
# Dropping 'Period', 'Percent Complete', 'Percent Pending Investigation', 'Footnote', 'Footnote symbol', 'Predicted Value', 
# 'Year' and 'Month' as I won't be needing them anymore.

drug_overdose.drop(columns=['Period', 'Percent Complete', 'Percent Pending Investigation', 'Footnote', 'Footnote Symbol', 'Predicted Value', 'Year', 'Month'], inplace= True)

drug_overdose.sample(5)
Out[44]:
Indicator Data Value State Date
49933 Natural & semi-synthetic opioids (T40.2) 677 Tennessee 2017-02-01
45420 Natural & semi-synthetic opioids (T40.2) 131 Oregon 2022-11-01
48559 Synthetic opioids, excl. methadone (T40.4) NaN South Dakota 2016-01-01
27275 Heroin (T40.1) 454 Michigan 2020-12-01
33672 Methadone (T40.3) 103 North Carolina 2021-01-01
In [45]:
# Finding unique values within 'Indicator' that I will be dropping except, 'Number of Drug Overdose Deaths'.

drug_overdose['Indicator'].unique()
Out[45]:
array(['Cocaine (T40.5)', 'Natural & semi-synthetic opioids (T40.2)',
       'Number of Drug Overdose Deaths', 'Methadone (T40.3)',
       'Synthetic opioids, excl. methadone (T40.4)', 'Number of Deaths',
       'Psychostimulants with abuse potential (T43.6)',
       'Natural, semi-synthetic, & synthetic opioids, incl. methadone (T40.2-T40.4)',
       'Percent with drugs specified',
       'Natural & semi-synthetic opioids, incl. methadone (T40.2, T40.3)',
       'Heroin (T40.1)', 'Opioids (T40.0-T40.4,T40.6)'], dtype=object)
In [46]:
# Filtering and dropping rows that do not contain, 'Number of Drug Overdose Deaths'.

drug_filter = drug_overdose['Indicator'] == 'Number of Drug Overdose Deaths'

drug_overdose = drug_overdose[drug_filter]

drug_overdose.sample(5)
Out[46]:
Indicator Data Value State Date
1808 Number of Drug Overdose Deaths 751 Alabama 2018-10-01
47334 Number of Drug Overdose Deaths 804 South Carolina 2016-06-01
33320 Number of Drug Overdose Deaths 2,245 North Carolina 2018-10-01
20041 Number of Drug Overdose Deaths 330 Kansas 2015-06-01
57704 Number of Drug Overdose Deaths 1,251 Washington 2019-12-01
In [47]:
# Making sure that the null values have been dealt with. What I assumed was correct, they were in catergories I didnt' need.

drug_overdose.isnull().sum()
Out[47]:
Indicator     0
Data Value    0
State         0
Date          0
dtype: int64
In [48]:
# Checking dates. These will need to match with the border_crossing dataset. Will drop rows accordingly.

drug_max = drug_overdose['Date'].max()

drug_min = drug_overdose['Date'].min()

print(drug_max)
print(drug_min)
2023-08-01 00:00:00
2015-01-01 00:00:00

Dropping the states that are not border states is next. Dropping rows based on specific column ('State') values. They include: Maine, Texas, Washington, North Dakota, Arizona, Vermont, Montana, Alaska, New York, Michigan, Minnesota, New Mexico, California, and Idaho.

In [49]:
drop_states = [ 'Alabama', 'Arkansas', 'Colorado', 'Connecticut', 'Delaware', 'District of Columbia', 'Florida', 'Georgia',
                'Hawaii', 'Illinois', 'Indiana', 'Iowa', 'Kansas', 'Kentucky', 'Louisiana',
                'Maryland', 'Massachusetts', 'Mississippi', 'Missouri', 'Nebraska', 'Nevada',
                'New Hampshire', 'New Jersey', 'New York City', 'North Carolina', 'Ohio', 'Oklahoma', 'Oregon',
                'Pennsylvania', 'Rhode Island', 'South Carolina', 'South Dakota', 'Tennessee',
                'Utah', 'Virginia', 'West Virginia', 'Wisconsin','Wyoming', 'United States']

drug_overdose = drug_overdose[drug_overdose.State.isin(drop_states) == False]

drug_overdose['State'].unique()
Out[49]:
array(['Alaska', 'Arizona', 'California', 'Idaho', 'Maine', 'Michigan',
       'Minnesota', 'Montana', 'North Dakota', 'New Mexico', 'New York',
       'Texas', 'Vermont', 'Washington'], dtype=object)
In [50]:
# Making sure that the states listed above match the border_filtered dataset.

border_sorted = sorted(border_filtered['State'].unique())

border_sorted
Out[50]:
['Alaska',
 'Arizona',
 'California',
 'Idaho',
 'Maine',
 'Michigan',
 'Minnesota',
 'Montana',
 'New Mexico',
 'New York',
 'North Dakota',
 'Texas',
 'Vermont',
 'Washington']
In [51]:
# Will no longer be needing 'Indicator' column as, 'Data Value' column is synonomus with it.

drug_overdose.drop(columns=['Indicator'], inplace= True)

drug_overdose.sample(5)
Out[51]:
Data Value State Date
34842 114 North Dakota 2020-11-01
56600 186 Vermont 2020-12-01
32346 153 Montana 2020-11-01
31753 123 Montana 2016-05-01
32567 190 Montana 2022-02-01
In [52]:
# Renaming'Data Value' with 'Drug Overdose Deaths'. This helps keeps things tidy when merging the datasets.

drug_overdose = drug_overdose.rename(columns={'Data Value': 'Drug Overdose Deaths'})

drug_overdose.sample(5)
Out[52]:
Drug Overdose Deaths State Date
28675 1,139 Minnesota 2021-02-01
27739 2,983 Michigan 2023-06-01
57469 1,120 Washington 2017-03-01
16947 250 Idaho 2020-04-01
31933 119 Montana 2017-09-01
In [53]:
# Checking to make sure data types are correct. I will be changing 'Drug Overdose Deaths' column to int64.

drug_overdose.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1456 entries, 2 to 58334
Data columns (total 3 columns):
 #   Column                Non-Null Count  Dtype         
---  ------                --------------  -----         
 0   Drug Overdose Deaths  1456 non-null   object        
 1   State                 1456 non-null   object        
 2   Date                  1456 non-null   datetime64[ns]
dtypes: datetime64[ns](1), object(2)
memory usage: 45.5+ KB
In [54]:
# I keep running into an error when trying to change the 'Drug Overdose Deaths' column into int64. The error has
# to do with there being commas in the values. Now I'm curious how many there actually are.


comma_count = drug_overdose['Drug Overdose Deaths'].str.contains(',').sum()

print(comma_count)
673
In [55]:
# Now that I know there a lot of commas, I'm going to get rid of them. They are not needed and I need to convert 
# the 'Drug Overdose Deaths' column.

drug_overdose['Drug Overdose Deaths'] = drug_overdose['Drug Overdose Deaths'].str.replace(',', '')

drug_overdose['Drug Overdose Deaths'] = drug_overdose['Drug Overdose Deaths'].astype('int64')

drug_overdose.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1456 entries, 2 to 58334
Data columns (total 3 columns):
 #   Column                Non-Null Count  Dtype         
---  ------                --------------  -----         
 0   Drug Overdose Deaths  1456 non-null   int64         
 1   State                 1456 non-null   object        
 2   Date                  1456 non-null   datetime64[ns]
dtypes: datetime64[ns](1), int64(1), object(1)
memory usage: 45.5+ KB

Combining the datasets together:

In [56]:
final_border = pd.merge(border_pivot, drug_overdose, on=['State', 'Date'])

final_border.sample(10)
Out[56]:
State Date Vehicle Total Drug Overdose Deaths
718 Minnesota 2022-11-01 96796 1379
457 Maine 2018-06-01 218552 404
555 Michigan 2017-12-01 1135349 2665
1188 Texas 2018-09-01 3565659 3035
1226 Texas 2021-11-01 2239830 4916
576 Michigan 2019-09-01 1205944 2338
799 Montana 2020-12-01 906 157
411 Idaho 2023-04-01 33862 368
762 Montana 2017-11-01 9865 115
1132 North Dakota 2022-09-01 104881 144
In [57]:
# Making sure that merge() worked and there are no NaN values. Success!

final_border.isna().sum()
Out[57]:
State                   0
Date                    0
Vehicle Total           0
Drug Overdose Deaths    0
dtype: int64
In [58]:
final_border.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1456 entries, 0 to 1455
Data columns (total 4 columns):
 #   Column                Non-Null Count  Dtype         
---  ------                --------------  -----         
 0   State                 1456 non-null   object        
 1   Date                  1456 non-null   datetime64[ns]
 2   Vehicle Total         1456 non-null   int64         
 3   Drug Overdose Deaths  1456 non-null   int64         
dtypes: datetime64[ns](1), int64(2), object(1)
memory usage: 56.9+ KB
In [59]:
final_check = final_border.loc[(final_border['State'] == 'Texas') & (final_border['Date'] == '2018-07-01')]

print(final_check)
      State       Date  Vehicle Total  Drug Overdose Deaths
1186  Texas 2018-07-01        3763665                  2975

1. Visualizing descriptive statistics

Overview of the distribution of drug overdose deaths.

For the descriptive statistics part of this project I wanted to really understand the spread of what drug overdoses and vehicle frequency look like in states that border the U.S. Mexico and U.S. Canada border. You will see that they are broken up by drug overdoses and vehicle frequency and then what they look like together. I started out using describe() on the 'Drug Overdose Deaths' column, which gives a basic overview of elementary statistics. The average amount of drug overdose deaths per month is around ~1501 and the max was ~12000. Below are more visualizations that will help to understand what drug overdose deaths in border states look like.

In [60]:
final_border['Drug Overdose Deaths'].describe()
Out[60]:
count     1456.000000
mean      1599.187500
std       2108.160212
min         59.000000
25%        209.000000
50%        690.000000
75%       2467.500000
max      12147.000000
Name: Drug Overdose Deaths, dtype: float64

The graph below is a Kernel Density Plot via Seaborn. Originally I had used pandas own plot() and added multiple components such as 'kind', 'xlim', and 'xticks'. While these two graphs were very similar, I found that using seaborn kdeplot() worked much better and elimated the need for extra elements that plot() requires. I can see that drug overdose deaths were congrugated below 1000 per month and steadily decrease after 1000. This is supported by the statistics above. There are also outliers around the 12000 mark.

Kernel Density Plot can be thought of as a much smoother version of a histogram. Instead of needing to put data into bins, it uses a smoothing function below the hood to visualize density distrubution.

In [61]:
DrugOD_column = final_border['Drug Overdose Deaths']

plt.figure(figsize=(17,8))

sns.kdeplot(DrugOD_column)
Out[61]:
<Axes: xlabel='Drug Overdose Deaths', ylabel='Density'>

The graph below is a visualization of drug overdose deaths per year, 2015-mid 2023. I used lineplot() via seaborn to show the mean and confidence interval over time. The dark line on the graph represents the mean drug overdose deaths per year, the shaded region behind the line is a 95% confidence interval. The confidence interval represents the range of possible values around the true mean. We can see a uptick in drug overdose deaths during and after 2020.

The issue with this graph is that is doesn't respresent the entirety of drug overdose deaths over time. In the Kernel Density graph above, we can see that even though the deaths above 2500 per month were small, they were still there. To understand that, I used a box graph next.

In [62]:
plt.figure(figsize=(17,7))

sns.lineplot(x='Date', y='Drug Overdose Deaths', data=final_border)
Out[62]:
<Axes: xlabel='Date', ylabel='Drug Overdose Deaths'>

I utilized the boxplot() via seaborn to visualize and understand the outliers in drug overdose deaths per year. The line in the middle of the boxplot is the median. The box show how spread out around the median the data is per year.

In [63]:
plt.figure(figsize=(17,9))

final_border['Year'] = final_border['Date'].dt.year

sns.boxplot(x='Year', y='Drug Overdose Deaths', width=0.5, data=final_border)
Out[63]:
<Axes: xlabel='Year', ylabel='Drug Overdose Deaths'>
In [64]:
plt.figure(figsize=(17,8))

sns.barplot(x='State', y='Drug Overdose Deaths', data=final_border)
Out[64]:
<Axes: xlabel='State', ylabel='Drug Overdose Deaths'>

Overview of entering vehicle frequency at the U.S. Mexico and U.S. Canada border.

In [65]:
final_border['Vehicle Total'].describe()
Out[65]:
count    1.456000e+03
mean     8.207859e+05
std      1.169088e+06
min      3.060000e+02
25%      7.565325e+04
50%      1.819060e+05
75%      1.204970e+06
max      4.759977e+06
Name: Vehicle Total, dtype: float64

The below graph is a Kernel Density plot. I chose this because it's smoother than a histogram, instead of having to put values into bins, as well as eliminating the need for extra elements to be added. The x-axis is by millions, we can see that the majority of the vehicles that pass through monthly are around 800,000, this would be on par with the statistics above. It tapers off around 1.5 million and experiences a small bump around 4 million per month. If I were to overlay this graph and the drug overdoses per month together, they would be very similar.

In [66]:
vehicle_freq = final_border['Vehicle Total']

plt.figure(figsize=(17,8))

sns.kdeplot(vehicle_freq)
Out[66]:
<Axes: xlabel='Vehicle Total', ylabel='Density'>

This bar plot below shows total vehicle frequency by state. We can see that the majority of vehicle frequency is passing through the U.S. Mexico border and less so on the U.S. Canada border, the largest vehicle frequency being in California and Texas.

In [67]:
plt.figure(figsize=(17,8))

plt.xticks(rotation=45)

sns.barplot(x='State', y='Vehicle Total', data=final_border)
Out[67]:
<Axes: xlabel='State', ylabel='Vehicle Total'>

The line plot below shows us that vehicle frequency was consitent from 2015-2019. During the summer months vehicle frequency would peak and during the winter months, it would dip back down. The consitency lasted until early 2020 when the Covid-19 lockdowns happened. During this time the borders to both Mexico and Canda were locked down. Vehicle freqeuncy has slowly picked back up again, around mid 2022 is when vehicle frequency returned to pre Covid-19 levels.

In [68]:
plt.figure(figsize=(17,7))

sns.lineplot(x='Date', y='Vehicle Total', data=final_border)
Out[68]:
<Axes: xlabel='Date', ylabel='Vehicle Total'>

Boxenplots are are great way to visualize outliers in state vehicle frequency.

In [69]:
plt.figure(figsize=(17,9))

sns.boxenplot(data=final_border, x="Vehicle Total", y="State")
Out[69]:
<Axes: xlabel='Vehicle Total', ylabel='State'>

Overview of drug overdose deaths and entering vehicle frequency at the U.S. Mexico and U.S. Canada border together.

Scaling is going to be required for the two columns that I want to visualize. This is because the average value from 'Vehicle Total' is 800,000 while the average value from 'Drug Overdose Deaths' column is 1599. MinMaxScaler scales data that is different so that it can be visualized and modeled correctly. The scaled dataframe will also be used for correlation and regression later.

In [70]:
# Defining the scaler.

scaler = MinMaxScaler()

# Fitting the scaler and transforming two columns.

final_border[['Vehicle Total', 'Drug Overdose Deaths']] = scaler.fit_transform(final_border[['Vehicle Total', 
                                                                                             'Drug Overdose Deaths']])

final_border.sample(5)
Out[70]:
State Date Vehicle Total Drug Overdose Deaths Year
531 Michigan 2015-12-01 0.250945 0.158339 2015
1425 Washington 2021-02-01 0.021662 0.145103 2021
54 Alaska 2019-07-01 0.025978 0.006287 2019
579 Michigan 2019-12-01 0.251605 0.190189 2019
42 Alaska 2018-07-01 0.024509 0.004715 2018

After the data has been scaled for both Vehicle Total and Drug Overdose Death columns, I wanted to visualize them. Below is scaled kde plot both columns, similar to the individual plots above.

In [71]:
plt.figure(figsize=(17,7))

scaled_DrugOD_column = final_border['Drug Overdose Deaths']
scaled_vehicle_total = final_border['Vehicle Total']


plt.figure(figsize=(17,8))

sns.kdeplot(scaled_DrugOD_column, fill='yes')
sns.kdeplot(scaled_vehicle_total, fill='yes')
Out[71]:
<Axes: xlabel='Drug Overdose Deaths', ylabel='Density'>
<Figure size 1700x700 with 0 Axes>

Visualizing both Drug Overdose Deaths and Vehicle Freqeuncy over time, from 2015-mid 2023

In [72]:
plt.figure(figsize=(17,7))

# Using line plot and getting rid of the confidence Interval.

sns.set_style('darkgrid')

sns.lineplot(x='Date', y='Drug Overdose Deaths', data=final_border, errorbar=None, color='green')
sns.lineplot(x='Date', y='Vehicle Total', data=final_border, errorbar=None, color='orange')
Out[72]:
<Axes: xlabel='Date', ylabel='Drug Overdose Deaths'>

Visualizing a scatterplot by state for both drug overdose deaths and vehicle freqeuncy.

In [73]:
plt.figure(figsize=(17,9))

sns.scatterplot(data=final_border, x="Drug Overdose Deaths", y="Vehicle Total", hue="State")
Out[73]:
<Axes: xlabel='Drug Overdose Deaths', ylabel='Vehicle Total'>

Left bottom corner zoomed in version of the plot above.

In [74]:
plt.figure(figsize=(17,9))

sns.scatterplot(data=final_border, x="Drug Overdose Deaths", y="Vehicle Total", hue="State")

plt.xlim(0.0,0.4)
plt.ylim(0.0,0.4)
Out[74]:
(0.0, 0.4)

2. Visualizing Correlation

Correlation is the measurement of the strength of the relationship between two variables. Below I found correlation between drug overdose deaths and vehicle frequency, on a scale of -1 (neg relationship) and 1 (positive relationship), the correlation here falls at 0.8! There is a positive relationship, when vehicle total begins to rise, drug overdose deaths do as well.

In [75]:
correlation = final_border['Vehicle Total'].corr(final_border['Drug Overdose Deaths'])

print(correlation)
0.8009629464114575
In [76]:
sns.lmplot(data=final_border, x="Drug Overdose Deaths", y="Vehicle Total", height=7)
Out[76]:
<seaborn.axisgrid.FacetGrid at 0x18790149bb0>

Hypothesis Test

Null Hypothesis:

$H_{0}$: The less vehicles that pass through the most frequented ports in each state, the less occurrence of overdose deaths in U.S. Mexico and U.S. Canda bordering states there will be.

Alternative Hypothesis:

$H_{A}$: The more vehicles that pass through the most frequented ports in each state, the larger the occurrence of overdose deaths in U.S. Mexico and U.S. Canda bordering states there will be.

Planned Statistical Test:

Linear Regression

Metrics grenerated from planned statistical test:

T-statistic, P-Value

Alpha Value (threshold):

$a$: 0.05%

if p-value is < 0.05 we reject the null hypothesis.

if p-value is > 0.05 we fail to reject the null hypothesis.

In [77]:
# Defining the independent and dependent variables. Using only the only two columns that are needed here.

x = final_border['Vehicle Total']
y = final_border['Drug Overdose Deaths']

# Adding a constant.

x_const = sm.add_constant(x)

# fit the model

regression_model = sm.OLS(y, x_const).fit()

# printing summary statistics

print(regression_model.summary())
                             OLS Regression Results                             
================================================================================
Dep. Variable:     Drug Overdose Deaths   R-squared:                       0.642
Model:                              OLS   Adj. R-squared:                  0.641
Method:                   Least Squares   F-statistic:                     2602.
Date:                  Sun, 24 Mar 2024   Prob (F-statistic):               0.00
Time:                          01:53:50   Log-Likelihood:                 1224.2
No. Observations:                  1456   AIC:                            -2444.
Df Residuals:                      1454   BIC:                            -2434.
Df Model:                             1                                         
Covariance Type:              nonrobust                                         
=================================================================================
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
const             0.0294      0.003      8.784      0.000       0.023       0.036
Vehicle Total     0.5687      0.011     51.012      0.000       0.547       0.591
==============================================================================
Omnibus:                      591.803   Durbin-Watson:                   0.049
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             3587.317
Skew:                           1.788   Prob(JB):                         0.00
Kurtosis:                       9.808   Cond. No.                         4.20
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

P-value is 0.05% we reject the null hypothesis. This means that there is a strong relationship between vehicle frequency and drugoverdose deaths. It is unlikely due to chance that there is a relationship. The alternative hypothesis is true.

In [78]:
# Trying plotly for the first time. I should have been using this all along.

fig = px.scatter(final_border, x="Drug Overdose Deaths", y="Vehicle Total", trendline="ols", trendline_color_override="orange")
fig.show()
In [ ]: